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ABSTRACT 



The use of scientific calculators on standardized 



mathematics tests is becoming more common, and how their use affects 
the standardized testing process continues to be a topic of study. 
This study extends previous work by examining test item 
charac t er i s t i cs and equity issues, asking whether items designed to 
be ’'calculator neutral” function as intended, and whether such items 
function the same for males and females. Data came from trials of the 
Midwestern Mathematics Placement Examination for precalculus and 
calculus classes for over 1,000 college students. A logistic 
regression analysis was conducted to detect differential item 
functioning (DIF) or nonuniform DIF. Findings suggest that calculator 
neutral items can be constructed, and that the items that were 
constructed to be calculator neutral were largely free from gender 
DIF. Results also showed that the logistic regression approach was a 
useful addition to studying DIF, although larger sample sizes would 
have been highly preferable. Using the rule-space approach as a basis 
for the content analyses was also quite effective. An appendix 
contains two sample problems and a description of the mathematical 
challenge items. (Contains 4 tables, 4 figures, and 17 references.) 
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Detecting DIF on mathematics items: The case for gender and calculator sensitivity 

INTRODUCTION 

The use of scientific calculators on standardized mathematics tests is becoming more 
common since the National Council of Teachers of Mathematics’ recommendation on this issue. 
(NCTM, 1989). For example, both the ATP Mathematics Level II Achievement Tesir and the 
mathematics portion of the Scholastic Assessment Test (SAT), published by Educational Testing 
Service, permit testtakers to use calculators when answering the test items. Consequently how the 
use of calculators impacts the standardized testing process continues to be a topic of 
investigations. Issues studied include effects on test speededness (Ansley, Spratt, & Forsyth, 

1989, Harvey, Jackson & Facher, 1993; Loyd, 1991;) and sensitivity of items to calculator use 
(Cohen & Kim, 1992. Harvey, et al.) 

The results of research on test speededness is mixed; minimal and moderate speededness 
effects were found (.Ansley, et al., 1989; Bridgman, Harvey, & Braswell, 1995; Harv'ey, et al., 
Loyd. 1991). The findings from studies on items classified as inactive (no advantage or 
disadvantage in calculator use), neutral (iteu, can be solved with or without a calculator), or 
active (a calculator is necessarv’/helpflil in answering the questions) with respect to item 
sensitivity to calculator use is more clear-cut. Examinees who used calculators had an advantage 
on neutral and active items while calculators were used infrequently on inactive items (Harvey, et 
al , 1993) In a study by Cohen and Kim. of the 28 items on the test, 5-12 items were susceptible 
to calculator effects depending on the method used (subscore and item level analyses) 

Other areas of investigation include the effects of calculator use on item and test 
characteristics and gender differences (A\nsley et al . 1989. Cohen &; Kim. 1992, Harvey et al.. 



Loyd, 1991). Cohen and Kim found no significant difference in an analysis of the test scores 
betA'Cen the use/non use of calculators on a college placement exam; however, several calculator 
effects were detected in item level analyses In a study involving high school students, no 
advantage to calculator use was detected even though 19 out of 25 items required low-level 
computation (Axisley et al., 1989). Loyd (1991) found a decrease in coefficient alpha items 
administered with calculators when comparing the performance of items answered with and 
without calculators. 

Using an .ANCOVA design (controlling on achievement), Bridgman et al. found males and 
females benefited equally from calculator use at the test score level. However, in the Harvey et al 
investigation, the Mantel-Haenzel procedure was used to detect differential item functioning for 
males and females. Females found all types of calculator items (active, inactive, neutral) 
differentially more difficult than males The mean MH-D-DIF values by item type ranged from 
moderate (X=-.25, SD=.59) for calculator inactive items to substantial (X=-.48; SD=.50) for 
calculator neutral items 

This study seeks to extend the work by Harvey et al., by further examining item 
characteristics and equity issues. The investigation is designed to address two questions. First, do 
items designed to be calculator- neutral, (i.e the type of item where calculator use might be 
helpful, but the item can be solved without a calculator use), ftinction as intended'^ Second, do 
calculator neutral items function the same for males and females'!’ 

METHODS 

Tests 
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The Exam (MMPE) pilot items are based on course 

content covered in pre-caiculos colleg t courses. While all in-comtng freshman with three years of 

high school mathematics will be administered the test, the purpose of the test is to place students 
m a pre-calculus course and a first semester calculus course. The test is a ‘low stakes assessment,' 
Students are not required to follow course placement recommendations based on the MMPE test 
score results. Nevertheless, accurate course placement is useful and efficient for students, faculty, 
and the institution (Ryan & Fan, 1 993), Fairness is also a concern; particularly in light of recent 
research which suggests that female performance in college mathematics courses is 
underpredicted by college entrance examinations like the Scholastic Antitiide Test-Mathema,;., 
(SAT-M) (Linn & Kessel, 1995: Wainer & Steinberg, 1993). 

The previous version of the MMPE was adequate for placing the students in appropriate 
courses, however, the exam resembled an aptitude test and did not reflect modern instructional 
approaches to mathematics likr calculator use. Consequently, the pilot test was designed to allow 
calculator use and was composed of algebra, trigonometty, and geometry items. However, to 
«void some of the standardization issues surrounding calculator use in tests like equal access to 
calculators, test items were designed to be primarily calculator neutral. 

Four-six Items were randomly assigned to two pilot test forms, each with twenty-three 
Items Forms A and B The test instructions for Forms A and B indicated no calculator use was 
allowed. Two other test forms were assembled: Forms C and D These forms were identical to 
Form A and Form B, respectively However, the test instructions for Forms C and D permitted 
ordinary- scientiflc calculator use when answering the test items Test instructions indicated 
Students were allowed 40 minutes to complete the test 
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Desimi and Sample- 



Data used in this investigation were collected from two item trials: Spring, 1995 and Fall 
1 995 The test forms were administered in pre-calculus and calculus classes. For both item trials, 

^ “St booklets were spiraled to create equivalent groups for data collection. The Spring 1 995 
study sample consisted of 346 undergraduates in pre-calculus and introductory calculus courses at 
a large midwestern university. The sample size for the test forms ranged from 82 students 
completing Form D to 94 students completing Form A. 

Any Item which was not functioning as intended according to the content and or statistical 
review was either deleted, revised, or re-classified. The Fall study involved a large scale 
administration of the revised test items which took place on August 25, 1995. Over one thousand 
testtakers in pre-calculus and calculus courses participated with the number of examinees 
completing each form ranging from 249 for Form B to 3 16 for Form A. Item data used in this 
study are from items that were common to both item pilots and were functioning as intended. 

Consequently seventeen items were retained from A and C and twelve items from B and D were 
retained for further analyses. 

Analyses: 

A logistic regression analysis was conducted for each test item to detect uniform DIF 
(after controlling on achievement, the probability of answering the item under study is greater for 
one group in comparison j another) or -on-uniform DIF (the difference in answering the item 
correctly for matched groups of test takers is not the same for ail achievement levels) 

(Swaminathan & Rogers, 1990) Logistic regression has the following formulation 
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P( !i= 1) = 

(1 +eO 

Zi=To-TiG, .TjXj -t3(Xig,), 

I IS the index for the testtaker; is the testtaker’s item response scored a 1 (correct) or 0 
(incorrect), X; iS the testtaker’s total score, and G is the testtakers group; G = 1 if the examinee is 
a member of the focal group; 2 if the examinee is a member of the rence group. 

The SAS-PC Proc Catmod procedure was used for each comparison (e g., calculator/no 
calculator) to estimate the parameters, i,, xj.Xx . The dependent variable, test item response, was 
coded as 0 (incorrect) or 1 (correct). The total score on the MMPE pilot items (X, achievement) 
was designated as the continuous independent variable or covariate. Members of the reference 
group (standard to compare performances of the focal group) were assigned a '2'; focal group 
memoers (subgroup of interest) were assigned a T for the group membership variable, 
if the sign of the estimate of the parameters (x,,or i,) is positive, the focal group is favored; 
othcr\\ise. the reference group is favored. Models were evaluated with a chi-square statistic with 
1 degree of freedom. Calculations for logistic regression are described in detail elsewhere 
(Swaminathan & Rogers, 1990). 

Analyses Desinn 

Two basic models were tested sequentially with a backwards procedure to examine 
gender, calculator use (CNC) and achievement effects. Table 1 provides a description of the 
logistic regression anaf ’ses that were conducted To test for non-uniform DIF, the parameter for a 
three-way interaction model was tested (i,) (Set 1) (See Camilli & Shepard, 1994 for details of 
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this formulation) To examine the effecis of calculator use, the model included an interaction term 
for calculator use by right/wrong by achievement. To investigate gender effects, the interaction 
teirn for gender by right/wrong by achievement was tested. To detect uniform DIF, the items free 
from non-uniform DIF were examined with a simpler model. With this model, the parameter, i, is 
tested. (Set 2). This model consisted of total score, calculator use, and gender as independent 
variables; the gender and calculator use parameters for each item were inspectei' 

Each of the 17 items were tested for Forms A and C and 12 items were tested for Forms B 
and D; the criterion was total test score on test items for all analyses The studied item was 
included in the criterion Test takers completing Form C (calculators) were specified as a 
reference group; examinees answering Form A (no calculators) were designated as the focal 
group To investigate gender effects, the corresponding forms were combined (e.g. test forms A 
and C). males from A and males from C served as the reference group, females from test forms A 
and C were specified as the focal group. Parallel analyses were conducted with the examinees' 
responses to Forms D and B of the test to replicate findings from Form.s C and A. 

— Insert Table 1 — 

Content analyses were conducted for any DIF items identified. Traditionally, content analyse;, of 
DIF items are based on Bloom’s taxonomy or inspection (Nandakamur, 1993; Ryan & Fan, 

1994/ Instead, the content analyses are based on the Rule-Space approach developed by k. 
Taisuoka (1993). This approach was adapted for reporting the math proficiencies for the new 
SAT-M (Harnisch, latsuoka, & Wilkins, 1995) Items are inspected in relationship to a set of 
attributes which are the cognitive kills necessary to answer the test question correctly. (See 
Appendix A for a list of attributes called math challenges.) 
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RESULTS 



The tests were designed to be parallel in content and difficulty. A summary of the 
descriptive statistics for the total sample, by form and by gender is repotted in Table 2. The A and 
C (AC) combined version of the test is somewhat easier than the B and D (BD) combined test 
form (X=7. 16 for 17 items versus X-4.14 for twelve items). There are minimal differences 
between the corresponding calculator and non-calculator forms of the test (less than . 1 5 s.d. for 

A-C and B-D forms) There are modest differences in test performance between males and 
females (less than 4 s.d. for the AC BD form.s). 

- — Insert Table 2 

Table 3 presents the results for the reliability analyses. The coefficient alpha estimates for 
the matching calculator and no-calculator forms were approximately .65 for Form AC and .61 for 
Form BD. However when the estimates were calculated separately, the reliability estimates for the 
forms which allowed calculator use (Forms C and D) were slightly higher (.63 vs .67 and .58 vs 
64) However these differences are not statistically significant (z = .24 for Forms A and C; z = 

.26 for Forms B and D (p>.05)) Estimates for the Spearman-Brown prophecy formula were also 
calculated tor 40 items and 45 items. The estimates ranged from .87 for Form D to .84 for Form 
C. The differences in the Spearman-Brown reliability estimates for the calculator and non- 
calculator versions (40 and 45 items) of the tests were not statistically significant (not reported). 

- — Insert Table 3- — 

The results for the logistic regression analyses are presented in Table 4. The parameter 
estimate for the three-wav interaction term was not statistically significant for any items from the 
test forms However, uniform DIF was detected for 4 items (Set 2 Analyses) Two items were 
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found to be differentially functioning when the items were tested using the simpler model for 
Forms A and C. The gender main effect was statistically significant for items 7 and 9; these items 
are differentially easier for male test takers 

- — Insert Table 4 — 

Plots of the score distributions and the probability of a correct response for male and 
female test takers for items 7 and 9 are presented in Figures 1 and 2. As shown in the Figures 1 
and 2, the probability of a correct response on this items is not the same for males and female 
testtakers at the same achievement levels. For example, for students who scored a 5 or 6 on item 
9, the probability of a correct response for men to answer the item correctly is approximately .6, 
In contrast, the probability of women (who scored a 5 or 6 on item 9) answering this question is 
around ,43. 

Based on an analysis of the attnbutes (aUributes 1,2, 3, 12), question 7 mvolves a ftmetion with a 
second degree algebraic expression. However, if testtakers did not know how to solve the ftmetion, they 
could use a test-taking strategy and work baekwards. They can compare the values to find the answer. The 
results from this item suggests women may be weaker in test-taking skills. (Sec Appendix A for the text of 
Item # 7 .) In order to sol\ c the other problem (9), testtakers need to know how to translate word problems 
into an algebraic expression and restructure the problem into a solvable form (attributes 5 and 6). 

-Insert Figures 1 and 2 here- — 

Item 1 on Forms B and D is also differentially more difficult for females. (See Figure 3). 

In addition the calculator version of item 2 is differentially easier for test takers. Figure 4 presents 
a plot of the probability of a correct response and total test score. As shown in Figure 4, the 
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probability of getting item 2 correct is higher for the students who 
question (Form D). 



used a calculator to answer the 



For .ten, I . resnakcr. need to know meaning of -average- and apply the property of average to 
reetructnre the story- problem mto a solvable fomt fattnbutcs 1,5 6) (See Appendt.x A for ,he text of .tern 
« 1 .) Lent 2 ,s a geometry- problem. Exammees need to know the meantngs of slope and intercept and how 

.0 add two toots (attnbntes I and 4). Perhaps calculators use helped students to avo.d anthmet.e errors tn 
calculating the addition of two factors 

Insert Figures 3 and 4 here — 

DISCUSSION 

This purpose of this study was to investigate whether the .MMPE test items intended to be 
-calculator neutral' perform as expected and whether these kinds of items are neutral for both 
males and females. The findings of this study suggest that calculator neutral items can be 
constructed, Futhermore, these items were largely free from gender DIF. However a more 
interesting question, whether the items are differentially more difficult or easier for females 
depending or. the use of a calculator was not investigated, because of sample size requirements 
when using logistic regression 7-|,e study has several other limitations also. 

The logistic regression approach is an a usefiil addiiional to studying DIF, The flexibility of 
this approach which provides the opponunily to investigate DIF in relationship to variables like 
ability, calculator use, and gender in combination is a distinct advantage. However, there is a cost 
with this flexibility, especially for smaller testing programs Findings from simulation studies 
suggest with sample sizes of 250 per group, DIF is detected with 75»-. accuracy when using 
logistic regression (Swaminathan & Rogers, 1990) To attain I00-. accuracy, sample sizes of 500 
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. , TO .nve.'.ga.e ,>,e guesUcn of different oondU.on'.ng variaWes U.e 

fo. each group are regu.re , 

calculator use, achievemem, and gender 

those available in this investigation wet respects: content, 

a n test items were clearly multidimensional in at least 

lnthatcurrentanalyt.cmethodsavailable.re.ploringO.^ 

and problem type, iv of results for tests of this type are 

e are large unidimensional, interpretation ot 

, „ai which were differentially easier for males were story 

troublesome. For example items a ooestions which are 

-n tvne items and test qucbuuu:^ 

hi The test specifications require stor>' p . 

problems. The tesi sp treated as a 

U ver there will not be enough items of each type 

not story problems, owe SIBTEST specifically 

a his colleagues are expecting to release a version of SlBThS 

r ate OIfI differential bundle .nctioning Ibundles of logically-lmhed items 

. aed) for tests des.gned to be multidimensional This should be 

rvhich do not ftinetion as intended) for tests 

. ■ , „ like the one used in this investigation, 

especially useftil for examining es ^ ^ ^ 
using the mte-space apptoac ^ 

hems tdentifted as DlF in this ,nve.igaU2C ^ 

Urn„andKcssc.fW.d)aso.^^^^^^ 

that these differences mas c ^ ^orm 

u, V <= differcntiallN more difficult than males (items . 

females find story pro cm o,p research ivilh a similar population. SAT-M 

. H,f,ndinas from Other gender Dir researen 

BD) There are similar selection and 

V r rltnn 199^^) Results from other studies s gg 

testtakers (Hams & Carlto . ^ ^ 

A4.„cxf,pa IQP'I Rn an & Fan. 1994) 
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Both logistic regression and the mlc-spacc approach basis to content anaKses can contnbute to 
further understanding of gender and math performance. In combination, these approaches may provide tlic 
opportuniu- tease out effects that ma\- be linked to testing and teac.'iing practices in mathematics (Linn & 

Kcssel. 1990), This would be a major step foiovard: the possibilitv’ of understanding DIF, not just idcrtif\- 
(t 
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Tabic 1 



ruF Analysis Design 



Anahscs 



Model 



Groups Compared 



DIF T>-pe 



Full model 
Gender + Sc 
Score -CMC + Score^CNC 



' Item = Score - Gender -r Score^Gender 



c: arhipve- 'Non-unitorm 

Males / Females and acmc\e 

ment; C/TiC and acmevement; 



Set 



Simpler model 



Males/ Females .C/NC 



Uniform 



Item = Score - Gender 
Score -*■ CNC 



Note. CNC means C 



no calculator use 



allo\^ed. 
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Tabic 2 



Dcscnptuc Statistics for MMPE Pilot 



Group 


N 


Mean 


Std Dev 


Min 


Max 


Forms A and C 


Males 


303 


757 


3,09 


0 


16 00 


Females 


262 


6 68 


2.89 


0 


16.00 


No calculator 


316 


6 97 


2 95 


0 


15.00 


Calculator 


249 


7.41 


3 12 


1 00 


16.00 


0\crall 


565 


7 16 


3.03 


0 


16.00 


Forms B and D 


Males 


297 


4 54 


2.45 


0 


12,00 


Females 


254 


3 67 


2.12 


0 


12,00 


No calculator 


307 


3 97 


2.27 


0 


1 1,00 


Calculator 


244 


4.35 


2 42 


0 


12,00 


0\crall 


551 


4 14 


2.34 


0 


12,00 
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Table 3 



Miabilit^ AnalNscs fo r jhe MMPE Pilot T est: KR-20 and Spenrmnn-BrouTi Prnph^.v 



Form 


KR-20 


Speaiman-BrowTi Prophcc\- 
40 items 45 items 


A and C 


0.648 


0.812 


0.829 


A 


0.629 


0.800 


0.818 


C 


0 667 


0.825 


0.842 


B and D 


0.608 


0.838 


0.854 


B 


0.583 


0.823 


0.840 


D 


0.635 


0.853 


0.867 



) b 
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Tabic 4 



Summan- of Logistic Regression Anak-sis for .9ct 2 .AjiaK-scs 



Form 


Item 


Effect 


P-value 


Parameter 

Estimate 


Favors 


A&C 


7 


Gender 


0.016 


-0.468 


Male 




9 


Gender 


0.002 


-0.583 


Male 


B&D 


1 


Gender 


0 001 


-0.621 


Male 




2 


CNC 


0.030 


-0 486 


Calculator 



^ CNC means calculator allowed and no calculator allo\Ncd. For gender, 
females are the focal group and males arc the reference group. For CNC, the 

group that used the calculator is the reference groups; the focal group did not 
use a calculator 
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Score Range 



Figure 1 Plot of the probability of a correct response for item 7 (Forms A and C) and total score 
for males and females. 




Figure 2. Plot of the probability of a correct response for item 9 (Forms A and C) and total score 
for males and females. 
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Figure 3: Plot of the probability of a correct response for item 1 (Forms B and D) and total score 
for males and females. 




Figure 4: Plot ot the probability of a correct response for item 2 and total score for Form B and 
Form D testtakers. 
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APPENDIX A 
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Forms A and C 



7 The smallest possible value of f(s.) = (x-5)' - 5 is 
(a) -10 (b)-5 (c)0 (d)20 (e) 5 

Forms B and D 

1 . Sam received grades of 87. 75. and 72 on three math tests. WTiat average does he need on the nc.xt Uvo 
tests in order to average 80 on all fi\ e'.’ 

(a) 81 (b)82 (c)83 (d) 85 (e ) 86 
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Description of Mathematical Challenges: 
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